[V1][Core] Add a cache hit threshold for requests#24520
[V1][Core] Add a cache hit threshold for requests#24520kfirwolfson wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a cache hit threshold to gate requests, which is a useful optimization for disaggregated deployments. The implementation is mostly solid, covering configuration, API exposure, and the core scheduling logic.
I've identified a critical issue that could lead to a ZeroDivisionError in the scheduler when processing requests with empty prompts. Additionally, there's a code duplication issue in the protocol validation that should be addressed to improve maintainability. My detailed comments provide suggestions for fixing these issues.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
3425995 to
7c0485e
Compare
0b75346 to
8be6b61
Compare
|
@robertgshaw2-redhat self tag |
8be6b61 to
0400566
Compare
4d756b7 to
0c15acc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0c15acc to
0c9cb3f
Compare
|
Updated after merge of #26813. Would appreciate help in reviewing @markmc, @njhill, @tlrmchlsmth |
|
I'm sorry we have been slow to give feedback on this. And all I have at this stage are some very high-level comments ... For the decode preemption case - I think it's most natural to look at this first through the same lens as For the "decode-first P/D" case - this is introducing an alternative KV transfer protocol flow, an additional step before the current prefill-first flow kicks in, and might e.g. lead to additional Both of these make sense, but I'm wary of the specifics of the proposal:
I think I'd be more immediately supportive of doing this in baby-steps, each of which could be a standalone PR:
|
|
This pull request has merge conflicts that must be resolved before it can be |
a897a78 to
b3123e1
Compare
|
Thanks for the detailed discussion points, @markmc, answering inline.
We think it’s a bit too strict to decide “no prefill at all” in the Decode instance. There are cases in which a little prefill on the Decode node makes sense, as exemplified below.
Not sure what you are referring to in “additional kv_transfer_params in the prefill request”. The “decode-first P/D” flow is an alternative to the current prefill-first flow. If there is enough cache, the Decoder will handle the request and there will be no remote prefill request (and no kv_transfer_params returned from the Prefiller). If the cache hit is low, then the flow continues in a similar way to the current prefill-flow: send the request to the Prefiller and then send it to the Decoder with the kv_transfer_params the Prefiller returned.
Not sure if you mean “KV Connector” or “KV Transfer”, as they are not exactly interchangeable. The feature can be useful for just KV-offloading using KV-Connectors as well (without direct transfers). One example is described under “Other Scenarios” in the RFC description, where cache is offloaded to local DRAM or disks but not transferred, and there is no P/D-D. A more practical use-case is a PD-D scenario with a shared KV Cache reachable by both Prefillers and Decoders, without direct KV Transfers between the two.
As mentioned above and described in the RFC, some prefill work on the Decoder may be acceptable. This is also in “happy paths” and not just for Preemption – some connectors for example move data at the token-block or chunk of blocks resolution. In those cases, the cache will not contain the tail of the request (beyond block/chunk boundaries), which will be calculated using prefill on the Decoder.
Decode first is a router/sidecar flow, not something vLLM needs to be aware of. vLLM doesn’t really have has anything to do with this information, as it just handles incoming requests as they come. The global or per-request threshold allows the router/sidecar to control the flow (the diagram in the RFC shows this). A flag would not suffice and we need a tunable threshold.
We believe the tunable threshold per request gives the highest flexibility. Without it the system is fixed and the threshold has to be determined at loading time. Thinking about it some more, if we want to reduce complexity, we can decide to avoid a global-threshold altogether and use only per-request thresholds. An example usage for this flow can be seen in the parallel work now being done for adding Decode-first support to llm-d, see feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present If you want, we would love to set up some time to review the suggestion as a whole. |
b3123e1 to
48130a2
Compare
48130a2 to
3d5fba0
Compare
|
Hi @kfirwolfson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
3d5fba0 to
f5f75ef
Compare
|
Hi @kfirwolfson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
f5f75ef to
6d05bc0
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Fix Gemini CR comments Add unit tests Move from SamplingParams to request unit test remake fix static code analysis rejects Fix unit test fix after local CR fix pre-commit reject add threshold to request logger and fix some calls to encode fix ruff Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
…oject#32726 review Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
641a7c8 to
d11d1fa
Compare
[V1][Core] Add a cache hit threshold for requests
Purpose
Introduce an optional KV-cache hit-rate gating mechanism, discussed in RFC #24256, to skip requests that are unlikely to benefit from prefill in P/D disaggregated deployments.
Edit: an additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.
What this PR adds
--global-cache-hit-threshold([0.0–1.0], default0.0)cache_hit_threshold([0.0–1.0]) in incoming requestChatCompletionRequest/CompletionRequest(validated in the protocol layer)."cache_threshold"exposed via v1 engine API. Requests rejected by this gating return HTTP 200 withfinish_reason="cache_threshold"and no output tokens.VllmConfigandSchedulerConfig.[0.0, 1.0].Why
Backwards compatibility
0.0→ feature is disabled by default. No behavior change unless the threshold is set globally or per request.Test Plan
1) Unit Tests
Unit tests check the scheduler logic, including
2) E2E manual tests
Run vllm serve with
--global-cache-hit-threshold 0.8argument to set a some default value. We'll override it in most requests.Scheduler computes
hit_ratio = computed_tokens / prompt_tokensWe will send 4 requests. Note the order of sending them matters as the first request fills the cache other depend on
cache_hit_threshold: 0so it’s guaranteed to execute and populate the KV-cachecache_hit_threshold: 0.33cache_hit_thresholdfield, which means global value of0.8will take effectRequest 1) Warm the cache
This run uses
cache_hit_threshold: 0so it’s guaranteed to execute and populate the KV-cache for the base segment.Request 2) MISS case
Expected: HTTP 200 with
"finish_reason": "cache_threshold"Request 3) HIT case
Expected: normal generation (
"finish_reason"is not"cache_threshold").Request 4) MISS case using global threshold
Use global threshold set to
0.8Expected: HTTP 200 with
"finish_reason": "cache_threshold"Notes
Test Result
E2E Local smoke tests on a single node:
200withfinish_reason: "cache_threshold"and empty outputs.Request cmpl-410004b615a54d73b7e9f0deebf2b852-0 rejected: cache hit rate 0.28 < threshold 0.33 (request)Request cmpl-6d66ba796f9247fcadca54ae428bf790-0 rejected: cache hit rate 0.41 < threshold 0.80 (global)0.0and1.0(not detailed above)Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.